Note: I’ve left in the code chunks to aid in easy marking, sorry if we were supposed to hide them!
Read in the gapminder_clean.csv data as a
tibble using read_csv
The first step of this analysis is to load in the CSV file we’ll be using - gapminder_clean.csv! In this step we’re also going to rename some of the more verbose columns used throughout this process, to make it easier on ourselves down the line.
gapminder <- read_csv("gapminder_clean.csv")
gapminder <- gapminder %>%
rename(co2 = "CO2 emissions (metric tons per capita)") %>%
rename(energy = "Energy use (kg of oil equivalent per capita)") %>%
rename(import = "Imports of goods and services (% of GDP)") %>%
rename(pop_density = "Population density (people per sq. km of land area)") %>%
rename(country = "Country Name") %>%
rename(life_exp = "Life expectancy at birth, total (years)")Filter the data to include only rows where Year
is 1962 and then make a scatter plot comparing
'CO2 emissions (metric tons per capita)' and
gdpPercap for the filtered data.
Now it’s time to make our first scatter plot! We start by just making a standard scatter plot as seen below.
gapminder %>%
filter(Year == 1962) %>%
ggplot(aes(x = co2, y = gdpPercap)) +
geom_point(color = "lightcoral") +
labs(x = "CO2 emissions (metric tons per capita)", y = "GDP (per capita)",
title = "GDP vs CO2 emissions (with outlier)") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))But this graph looks a little off scale. Upon closer inspection, we can see this is because there’s one datapoint way off in the distance that’s forcing the graph to adjust its scale to accommodate for it. We’ll assume this is an outlier, and filter it out of the dataset. We’ll also add in a line of best fit to get a sense of the correlation between data to prepare for the next question.
#getting rid of the outlier
gapminder_1962 <- gapminder %>%
filter(Year == 1962 & co2 < 20)
gapminder_1962_plot <- gapminder_1962 %>%
ggplot(aes(x = co2, y = gdpPercap)) +
geom_point(color = "lightcoral") +
geom_smooth(method = "lm", se = FALSE, color = "grey0") +
labs(x = "CO2 emissions (metric tons per capita)", y = "GDP (per capita)",
title = "GDP vs CO2 emissions") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_1962_plot)Much better!
On the filtered data, calculate the correlation of
'CO2 emissions (metric tons per capita)' and
gdpPercap. What is the correlation and associated p
value?
We now want to calculate the correlation between these two factors. Correlation essentially shows the strength of a relationship between two variables - a positive correlation score means that as one increases, so does the other - while a negative score means that as one variable increases, the other decreases. Based on the line of best fit in the previous graph, I’m assuming we have a positive correlation. But to be sure, we’ll run a test!
cor_1962 <- gapminder_1962 %>%
with(cor(co2, gdpPercap, use = "complete.obs")) %>%
format(digits = 3)
cor_1962_p <- gapminder_1962 %>%
with(cor.test(co2, gdpPercap, use = "complete.obs")$p.value) %>%
format(digits = 3)The correlation score we got for this relationship is 0.806 which is a fairly strong positive correlation.
We also want to note the p value for this calculation. A super simplified explaination of what the p value is - is how likely it is that the trend we observed was the result of random chance. Generally, if a p value is smaller than 0.05 we consider the chance of this trend occuring randomly to be so small that the results are considered statistically significant. The p value for this test was 1.08e-25, which is < 0.05, meaning it’s significant!
On the unfiltered data, answer “In what year is the
correlation between
'CO2 emissions (metric tons per capita)' and
gdpPercap the strongest?”
To answer this question, we want to map the correlation scores across all the years in the data set (from 1962 - 2007). We do this by making a line graph to visualize this data.
#making the correlation tibble
gapminder_cor <- gapminder %>%
group_by(Year) %>%
summarise(cor = cor(co2, gdpPercap, use = "complete.obs"))
#plotting them
gapminder_cor_plot <- gapminder_cor %>%
ggplot(aes(x = Year, y = cor)) +
geom_line(color = "lightcoral") +
labs(x = "Year", y = "Correlation",
title = "Correlation between CO2 emissions and GDP vs Year") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_cor_plot)#year with highest correlation score
max_cor_year <- gapminder_cor %>%
filter(cor == max(cor)) %>%
pull(Year)From this graph we can see the year where correlation was strongest was 1967!
Using plotly, create an interactive scatter plot
comparing 'CO2 emissions (metric tons per capita)' and
gdpPercap, where the point size is determined by
pop (population) and the color is determined by the
continent.
Time to make another interactive scatter plot as specified above!
gapminder_year_highest_corr <- gapminder %>%
filter(Year == max_cor_year) %>%
ggplot(aes(x = co2, y = gdpPercap, color = continent, size = pop)) +
geom_point() +
labs(x = "CO2 emissions (metric tons per capita)", y = "GDP (per capita)",
title = "Year vs CO2 emissions") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
gapminder_year_highest_corr_plot <- ggplotly(gapminder_year_highest_corr)
gapminder_year_highest_corr_plot %>% layout(legend=list(title=list(text='Continent Population ')))What is the relationship between continent and
'Energy use (kg of oil equivalent per capita)'?
The first place to start when exploring relationships between 2 variables in a data set is through visualization. Given that continent is a categorical variable, and energy use is continuous variable - we’ll use box plot to start.
gapminder_cont_energy <- gapminder %>%
ggplot(mapping = aes(x = continent, y = energy)) +
geom_boxplot(fill = "lightcoral") +
labs(x = "Continent", y = "Energy use (kg of oil equivalent per capita)",
title = "Energy use vs Continent") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_cont_energy)These graphs show a couple of things. First, there’s a bunch of energy results that have no continent attached, and second there seems be have a bunch of outliers - particularly over energy use 8000. This makes sense - as some continents have some very rich countries among a poor majority - for example Signapore in Asia, who you’d expect to use more power. We do want to get rid of some of these outliers though, so we an get the general trend among continents and gain more information from the graph. We’ll run this plot again, but with less outliers, and getting rid of the energy use that have no attached continent.
gapminder_cont_energy_no_out <- gapminder %>%
filter(!is.na(continent) & energy <= 8000)
gapminder_cont_energy_no_out_plot <- gapminder_cont_energy_no_out %>%
ggplot(mapping = aes(x = continent, y = energy)) +
geom_boxplot(fill = "lightcoral") +
labs(x = "Continent", y = "Energy use (kg of oil equivalent per capita)",
title = "Energy use vs Continent (less outliers)") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_cont_energy_no_out_plot)Based on this visual, it seems like the continent does have some influence on the energy use per capita. In particular, looks like Oceania generally has the highest energy use distribution per capita. But we need to use a statistical test to be sure! The country is out predictor variable - as this is the variable that influences the result, and the energy use is our outcome variable, as this is what we measure to determine the relationship. Given we have a categorical predictor variable, and a quantitative outcome variable, and we’re comparing multiple groups (countries) with only one outcome variable (how much energy) - we’ll choose an ANOVA statistical test. The ANOVA test determines is there’s a statistically significant difference between catagorical groups - so it’ll help us determine if there is a relationship between contient and energy use per capita.
continent_energy_aov <- aov(energy ~ continent, data = gapminder_cont_energy_no_out)
summary(continent_energy_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 7.803e+08 195079999 88.6 <2e-16 ***
## Residuals 825 1.816e+09 2201702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_p <- format(summary(continent_energy_aov)[[1]][["Pr(>F)"]][1], digits = 3)This test tells us there’s a significant correlation between the continent and energy use per capita, with a p-value of 1.18e-62!
To see if we can get anymore insights about the nature of this relationship - lets observe the shape of the distribution for each of the continents.
gapminder_cont_energy_shape <- gapminder_cont_energy_no_out %>%
ggplot(mapping = aes(x = energy, fill = continent)) +
geom_density(alpha = .3) +
labs(x = "Energy use (kg of oil equivalent per capita)",
title = "Shape of Energy Use across Continents") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
gapminder_cont_energy_shape_plot <- ggplotly(gapminder_cont_energy_shape)
gapminder_cont_energy_shape_plot %>% layout(legend=list(title=list(text='Continents:')))This gives us some interesting new insights. Africa is very right skewed - meaning all of it’s results are clustered around a low energy per capita. We can also see while the Americas are predominately dense in a low energy use section of the graph, they have a bump near the 8000s - which makes sense as the US presumably would use a lot of power per person. Oceania was more consistent, having less outliers in general and having its distribution centered. This captures the notion that while Oceania would seem to be the continent with the highest energy use overall, other continents have a wide skew, with some counties being overall very low, but certain high energy use outliers that aren’t captured in the data when looking on a continent level.
Is there a significant difference between Europe and Asia
with respect to 'Imports of goods and services (% of GDP)'
in the years after 1990?
As before, because we have a categorical variable and a quantitative variable - we’ll start with a boxplot.
gapminder_europe_asia <- filter(gapminder, (continent == 'Europe' | continent == 'Asia') & Year > 1990)
gapminder_europe_asia_plot <- gapminder_europe_asia %>%
ggplot(mapping = aes(x = continent, y = import)) +
geom_boxplot(fill = "lightcoral") +
labs(x = "Continent", y = "Imports of goods and services (% of GDP)",
title = "Imports of goods and services vs Continent") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_europe_asia_plot)It’s pretty hard to tell if there’s a significant difference between these two variables just from the graph, so we’ll move to a statistical test.
However, this graph does reveal that some of the data points seem to be over 100 - which seems incorrect given this variable is supposed to represent % of GDP. Imports of goods and services can’t make up more than 100% of total GDP, so we’ll filter these results out.
Given we have a categorical predictor variable, and a quantitative outcome variable, and we’re comparing two groups (Asia & Europe), we’ll choose a T-test! T-tests compare the means of two samples, and like the ANOVA test, will help determine if there is a statistically significant difference between the two groups.
gapminder_europe_asia <- filter(gapminder_europe_asia, import <= 100)
gapminder_europe_asia_ttest <- format(t.test(import ~ continent, data = gapminder_europe_asia)$p.value, digits = 3)The p value is 0.984, which is greater than 0.05 - meaning there’s not statistical difference between these variables.
What is the country (or countries) that has the highest
'Population density (people per sq. km of land area)'
across all years?
We’ll approach this by averaging the average population density across all the years, and then only displaying results over 1000 to avoid crowding the graph.
#creating the tibble with mean population data
gapminder_pop <- gapminder %>%
group_by(country) %>%
summarise(pop_mean = mean(pop_density)) %>%
filter(pop_mean > 1000)
#graphing the population data
gapminder_pop_graph <- gapminder_pop %>%
ggplot(aes(x = country, y = pop_mean)) +
geom_bar(stat = "identity", fill = "lightcoral") +
labs(x = "Country", y = "Population density (people per sq. km of land area)",
title = "Population density vs Country") +
theme(axis.title.y = element_text(size = 10, face = "italic"),
axis.title.x = element_text(size = 12, face = "italic"),
axis.text.x = element_text(angle = 50),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_pop_graph)#year with highest population mean
max_pop_year <- gapminder_pop %>%
filter(pop_mean == max(pop_mean)) %>%
pull(country)Seems like the country with the highest average population density across time is Macao SAR, China!
What country (or countries) has shown the greatest increase
in 'Life expectancy at birth, total (years)' since
1962?
We’ll begin approaching this question by finding the country with biggest difference between life expectancy in 1962 & 2007. We’ll only take countries with more than 27 years difference so we don’t crowd the graph.
#making tibble with life exp difference
gapminder_life <- gapminder %>%
arrange(Year) %>%
group_by(country) %>%
summarise(diff = last(life_exp) - first(life_exp)) %>%
filter(diff > 27)
#graphing it
gapminder_life_graph <- gapminder_life %>%
ggplot(aes(x = country, y = diff)) +
geom_bar(stat = "identity", fill = "lightcoral") +
labs(x = "Country", y = "Life expectancy at birth, total (years)",
title = "Life expectancy at birth vs Country") +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_life_graph)#country with greatest life_exp increase
max_life_country <- gapminder_life %>%
filter(diff == max(diff)) %>%
pull(country)Looks like Maldives has the greatest increase in life expectancy! As an exercise, let’s visualize its entire trajectory.
title_str = paste("Life expectancy across time in the", toString(max_life_country), sep=" ")
gapminder_max_life <- gapminder %>%
filter(country == toString(max_life_country)) %>%
ggplot(aes(x = Year, y = life_exp)) +
geom_line(color = "lightcoral") +
labs(x = "Year", y = "Life expectancy at birth, total (years)",
title = title_str ) +
theme(axis.title = element_text(size = 12, face = "italic"),
plot.title = element_text(face = "bold",
hjust = 0.5,
size = 14))
ggplotly(gapminder_max_life)